class: center, middle, inverse, title-slide # Introduction to ggplot2 ### Antoine & Nicolas ### cynkra GmbH ### February 24th, 2022 --- <style type="text/css"> .remark-code { font-size: 12px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Basics for visualisation in R using {ggplot2} * {ggplot2} is the most used R visualization package * "gg" stands for the "grammar of graphics" * It is shipped with {tidyverse} ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8.9000 ## ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ## ✓ readr 2.1.0 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` ??? In the {tidyverse} the standard package for visualization is {ggplot2}. The functions of this package follow a quite unique logic (the "grammar of graphics") and therefore require a special syntax. In this section we want to give a short introduction, how to get started with {ggplot2}. --- # Grammar of graphics * Grammar: A set of structural rules which help define and establish the component of a language * Grammar of graphics : framework to describe the components of any graphics With ggplot we don't create different plots by calling completely different functions, instead we recognize the universal quality of some components. A more general approach means there's a learning curve, but it pays off! --- # Our dataset ```r mpg ``` ``` ## # A tibble: 234 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa… ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa… ## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa… ## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa… ## # … with 230 more rows ``` * displ : engine displacement, in litres * cyl : number of cylinders * hwy : highway miles per gallon * drv: drive train, f = front-wheel drive, r = rear wd, 4 = 4wd * cty : city miles per gallon * class : "type" of car --- # Creating the plot skeleton The `ggplot()` function is used to set up the chart ```r ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) ``` 1. Choose dataset (argument `data`) : `mpg` 2. Map variables to visual properties (argument `mapping`) : - map `displ` to `x` (horizontal coord) - map `hwy` to `y` (horizontal coord) 3. Create a "ggplot" object containing this information, that can be enhanced to display a useful chart when it's printed --- # Creating the plot skeleton We map `displ` to the `x` coordinate, and `hwy` to the `y` coordinate, but we haven't mentioned what we want to see yet! ```r ggplot(mpg, aes(x = displ, y = hwy)) ``` <!-- --> ??? This created only an empty plot, because we did not tell {ggplot2} which geometry we want to use to display the variables we set in the `ggplot()` call. We do this by adding (with the help of the `+` operator after the `ggplot()`-call) a different function starting with `geom_` to provide this information. --- # Adding a layer To create a useful chart we need to add layers * There are different types of layers * Layers are added using the `+` operator * `geom_point()` creates a scatter plot layer ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` ??? This is maybe the most basic plot you can create. To map a different variable than `disp` to the x-axis, change the respective variable name in the `aes()` argument. --- # Adding a layer ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` <!-- --> * `geom_*()` functions define the **geometry** of a layer * Visual properties `x` and `y` are called aesthetics, different geoms recogize different aesthetics ??? This is maybe the most basic plot you can create. To map a different variable than `disp` to the x-axis, change the respective variable name in the `aes()` argument. --- # Teaser #1 We have more aesthetics! ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() ``` <!-- --> --- # Teaser #2 We can combine "geom" layers ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth() ``` <!-- --> --- # Teaser #3 And we can do much more! ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE) + facet_wrap(vars(drv)) + theme_bw() + labs(title = "Highway miles per gallon vs engine displacement", subtitle = "for different drive train types") ``` <!-- --> ??? Always good to have: The *ggplot2* cheatsheet (https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf). --- # `geom_*` functions "geom" = geometric object, see documentation : https://ggplot2.tidyverse.org/reference/#section-layer-geoms ```r ls("package:ggplot2", pattern = "^geom_") ``` ``` ## [1] "geom_abline" "geom_area" "geom_bar" ## [4] "geom_bin_2d" "geom_bin2d" "geom_blank" ## [7] "geom_boxplot" "geom_col" "geom_contour" ## [10] "geom_contour_filled" "geom_count" "geom_crossbar" ## [13] "geom_curve" "geom_density" "geom_density_2d" ## [16] "geom_density_2d_filled" "geom_density2d" "geom_density2d_filled" ## [19] "geom_dotplot" "geom_errorbar" "geom_errorbarh" ## [22] "geom_freqpoly" "geom_function" "geom_hex" ## [25] "geom_histogram" "geom_hline" "geom_jitter" ## [28] "geom_label" "geom_line" "geom_linerange" ## [31] "geom_map" "geom_path" "geom_point" ## [34] "geom_pointrange" "geom_polygon" "geom_qq" ## [37] "geom_qq_line" "geom_quantile" "geom_raster" ## [40] "geom_rect" "geom_ribbon" "geom_rug" ## [43] "geom_segment" "geom_sf" "geom_sf_label" ## [46] "geom_sf_text" "geom_smooth" "geom_spoke" ## [49] "geom_step" "geom_text" "geom_tile" ## [52] "geom_violin" "geom_vline" ``` --- # `geom_*` functions The most popular ones are - `geom_point()` - `geom_line()` - `geom_bar()` - `geom_col()` - `geom_histogram()` - `geom_boxplot()` - `geom_text()` / `geom_label()` --- # `geom_point()` We know a bit about it now, we'll use it to show case some features that can be generalized to other features. --- # `geom_point()` In `?geom_point` we find which aesthetics are available : - `x` - `y` - `alpha` - `colour` (or `color`) - `fill` - `group` - `shape` - `size` - `stroke` --- # `geom_point()` We can run `vignette("ggplot2-specs")` to know more about them, the main ones are: - `x` (required) : horizontal coordinate - `y` (required) : vertical coordinate - `colour` (or `color`) : color of contour or point - `shape` : shape of the point - `size` : size of the point - `alpha` : opacity (low = more transparent) --- # `geom_point()` `color` **mapping** To map a variable to an aesthetics we use `aes()` in the `mapping` argument of `ggplot()` ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() ``` <!-- --> --- # `geom_point()` `color` **setting** We set an aesthetic to a constant value if we provide it in the geom function as an independent parameter. ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue") ``` <!-- --> --- # `geom_point()` `color` mapping or setting ? Oops! what happened here ? ```r ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + geom_point() ``` <!-- --> --- # `geom_point()` Some aesthetis like `color` can take both discrete and continuous values ```r ggplot(mpg, aes(x = displ, y = hwy, color = cty)) + geom_point() ``` <!-- --> --- # `geom_point()` `shape` should be mapped to a discrete variable ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv, shape = drv)) + geom_point() ``` <!-- --> --- # `geom_point()` but `shape` can also be defined with the same symbol for all points in the plot ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(shape = "triangle") # we can also provide an integer value ``` <!-- --> --- # `geom_point()` `size` should be mapped to a continuous variable ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cty)) + geom_point() ``` <!-- --> --- # `geom_point()` We can also set a specific `size` for all points in the plot ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(size = 8) ``` <!-- --> --- # `geom_point()` `alpha` is most useful when set as a constant, to improve visualizations when overplotting ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(size = 8, alpha = 0.3) ``` <!-- --> --- # `geom_point()` `geom_*()` functions have a `position` argument to adjust the position of geometric objects. A common use for `geom_point()` is `position = "jitter"`, also good to handle overplotting! ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(position = "jitter") ``` <!-- --> --- # `geom_point()` `position = "jitter"` and `position = position_jitter()` are equivalent, but `position_jitter()` offers more control through arguments ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(position = position_jitter(width = .5, height = .5)) ``` <!-- --> --- # Exercise Using the `iris` dataset available in R by default and the knowledge from the previous slides, create a scatter plot using 2 numeric columns, and use `Species` as a color. You can use the pattern we used above. ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() ``` <!-- --> --- # `geom_line()` using `?geom_line` we find which aesthetics are available, the main ones are: - `x` (required) : horizontal coordinate - `y` (required) : vertical coordinate - `alpha` : opacity (low = more transparent) - `colour` (or `color`) : color of the line - `linetype` : type of the line - `size` : size of the line --- # `geom_line()` `color` and `linetype` We can use a constant value for these aesthetics if we provide it in the geom function. ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_line(color = "blue", linetype = "dotdash") ``` <!-- --> you can see the different line types using `?linetype` --- # `geom_line()` `linetype` should be mapped to a discrete variable ```r ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + geom_line() ``` <!-- --> --- # `geom_line()` using `size` you can change how thick are the lines ```r ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + geom_line(size=2) ``` <!-- --> --- # `geom_line()` Exercise : Using the `economics` dataset available in R by default and the knowledge from the previous slides, draw a plot using `geom_line` and different aesthetics mapped to variables. Change the color and use a dashed linetype. --- # `geom_bar()` using `?geom_bar` we find which aesthetics are available, the main ones are: you should use only one coordinate `x` or `y` - `x` : use it to create vertical bars - `y` : use it to create horizontal bars - `fill` : define bar colors - `alpha` : opacity (low = more transparent) --- # `geom_bar()` `fill` should be mapped to a discrete variable. When the coordinate (`x` or `y`) and `fill` are using the *same* variable each category has its own color. ```r ggplot(mpg, aes(x = drv, fill = drv)) + geom_bar() ``` <!-- --> --- # `geom_bar()` When the coordinate (`x`or `y`) and `fill` are using a *different* variable, each bar represents the distribution of the variable used in `fill` for each category of the coordinate (`x`or `y`), creating stacked bars. ```r ggplot(mpg, aes(x = drv, fill = class)) + geom_bar() ``` <!-- --> --- # `geom_bar()` `geom_bar()` is often used with `position = "dodge"` (equivalent to `position = position_dodge()`), and this parameter changes the plot from stacked bars to clustered bars. ```r ggplot(mpg, aes(x = drv, fill = class)) + geom_bar(position = "dodge") ``` <!-- --> --- # `geom_bar()` `position = "fill"` is useful too! ```r ggplot(mpg, aes(x = drv, fill = class)) + geom_bar(position = "fill") ``` <!-- --> --- # `geom_bar()` Exercise : Using the `diamonds` dataset and the knowledge from the previous slides, draw a barplot using different aesthetics mapped to variables. Use different positions to compare the plots. --- <!-- Probably it needs more explanation to understand the difference between geom_bar and geom_col --> # `geom_col()` ```r df <- data.frame(category = c("a", "b", "c"), value = c(2.3, 1.9, 3.2)) ggplot(df, aes(x = category, y = value)) + geom_col(fill = "darkblue") ``` <!-- --> --- # `geom_histogram()` .pull-left[ ```r ggplot(mpg, aes(x = displ)) + geom_histogram(fill = "darkblue", bins = 10) ``` <!-- --> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, fill = drv)) + geom_histogram(bins = 10) ``` <!-- --> ] --- # `geom_boxplot()` `geom_boxplot()` needs one variable to be of class `character` or `factor` (better) to initiate the grouping. ```r ggplot(mpg, aes(x = class, y = displ)) + geom_boxplot() ``` <!-- --> --- # `geom_text()`/`geom_label()` These work a lot like `geom_point()`, but with the added aesthetic `label` ```r avgs <- mpg %>% group_by(drv) %>% summarize(displ = mean(displ), hwy = mean(hwy)) ggplot(avgs, aes(x = displ, y = hwy, label = drv, color = drv)) + geom_label() ``` <!-- --> --- # `geom_text()`/`geom_label()` These work a lot like `geom_point()`, but with the added aesthetic `label` ```r avgs <- mpg %>% group_by(drv) %>% summarize(displ = mean(displ), hwy = mean(hwy)) ggplot(avgs, aes(x = displ, y = hwy, label = drv)) + geom_text(color = "black") ``` <!-- --> --- # Inheritance `geom_*` functions need data and aesthetics, if they are not provided directly they are inherited fron the object created by `ggplot()`. This is what we've done so far and is mostly recommended but the following works ```r ggplot() + geom_point(aes(x = displ, y = hwy, color = drv), mpg) ``` <!-- --> --- # Inheritance A use case ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_label(data = avgs, aes(label = drv), nudge_y = 3, show.legend = FALSE) ``` <!-- --> --- # Summary : What makes a layer ? A layer combines : * **data**, provided through the `data` argument * A **geometric object**, that we define by choosing the relevant `geom_*()` function * The **mapping** between data variables and visual components ("aesthetics"), provided through the `mapping` argument * A **statistical transformation** to apply on variables, provided by the `stat` argument (usually defaults are what we need, not covered in this course) * A **position** adjustment in case of overlapping data, provided through the `position` argument and `position_*()` functions --- # Legend, and labels Using `labs` you can set the title/subtitle and also add a caption and a tag. .pull-left[ ```r ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = class) ) + geom_point() + labs( x = "Displacement", y = "Highway mileage\n[miles per gallon]", color = "Car class", title = "Highway mileages depending on displacement", subtitle = "By car class" ) ``` ] -- .pull-right[ <!-- --> ] --- # Legend, and labels Exercise : Using the `iris` dataset and the knowledge from the previous slides, draw a boxplot using different aesthetics mapped to variables, and also change the labels and title. --- # Legend, and labels There are often several ways of doing thing in {ggplot2} .pull-left[ ```r ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = class) ) + geom_point() + xlab("Displacement") + ylab("Highway mileage\n[miles per gallon]") + ggtitle("Highway mileages depending on displacement", "By car class") + scale_color_discrete(guide = guide_legend("Car class")) # more complex, more control! ``` ] .pull-right[ <!-- --> ] --- # `scale_*()` functions Scales control the details of how data values are translated to visual properties There is a plethora of such functions, their name follow the `scale_<aes>_*()` pattern. For example to change the scale of an axis to a logarithmic scale: .pull-left[ ```r ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = class) ) + geom_point() + scale_x_log10() ``` ] More: https://stackoverflow.com/questions/70942728 -- .pull-right[ <!-- --> ] --- # Facetting “Facetting” denotes an idea of dividing a graphic into sub-graphics based on the (categorical) values of one or more variables of a dataset. Therefore, each sub-graphic shows a plot for a subset of the data. `facet_grid()` and `facet_wrap()` provide 2 ways of facetting: .pull-left[ ```r facet_grid(facets = vars(<variable>), scales = "fixed", ...) facet_wrap(rows = vars(<variable>), cols = vars(<variable>), scales = "fixed", ...) ``` ] .pull-right[ <img src="data:image/png;base64,#fig/facets.png" width="800"/> ] --- # `facet_wrap` function with `?facet_wrap` we can find the different arguments to change our plot. - `facets` (required) : set of categorical variables - `dir` : `"h"` (default) for horizontal, or `"v"`, for vertical orientation. - `nrow` or `ncol` : Number of rows and columns. --- # `facet_wrap` direction .pull-left[ ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = drv)) + facet_wrap(vars(year)) ``` <!-- --> ] .pull-right[ ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = drv)) + facet_wrap(vars(year), dir = "v") ``` <!-- --> ] --- # `facet_wrap` `nrow` or `ncol` .pull-left[ ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = drv)) + facet_wrap(vars(class)) ``` <!-- --> ] .pull-right[ ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = drv)) + facet_wrap(vars(class), nrow = 2) ``` <!-- --> ] --- # `facet_grid` function While `facet_wrap()` tries to act smart and hide non-existing combinations of sub-plots, `facet_grid()` will create a full matrix of sub-plots for all possible combinations. Most of the time when using only one categorical variable, `facet_wrap()` does a good job and is preferred over `facet_grid()`. However, `facet_grid` might be preferred in the following cases: - when faceting over >= 2 variables - when plots of empty combinations should be shown Let’s compare `facet_grid` and `facet_wrap` for 2 grouping variables. --- # `facet_grid` for 2 grouping variables ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = class)) + facet_grid(vars(year), vars(cyl)) ``` <!-- --> --- # `facet_wrap` for 2 grouping variables ```r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = class)) + facet_wrap(vars(year, cyl)) ``` <!-- --> --- # Themes The `theme()` function is used to customize the non-data components of your plots: * titles * labels * fonts * background * gridlines * legends --- # Themes `theme()` has a lot of arguments (95!), which themselves often need to be defined using other functions. You'll have to read the doc! OR use the simpler `theme_*()` wrapper functions documented in `?ggtheme`. These have good defaults and a limited amount of arguments easy to specify. ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + theme_classic() ``` <!-- --> --- # Themes `theme()` has a lot of arguments (95!), which themselves often need to be defined using other functions. You'll have to read the doc! OR use the simpler `theme_*()` wrapper functions documented in `?ggtheme`. These have good defaults and a limited amount of arguments easy to specify. ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + theme_light() ``` <!-- --> --- # Themes `theme()` has a lot of arguments (95!), which themselves often need to be defined using other functions. You'll have to read the doc! OR use the simpler `theme_*()` wrapper functions documented in `?ggtheme`. These have good defaults and a limited amount of arguments easy to specify. ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + theme_minimal() ``` <!-- --> --- # Export & saving The default way to export plots is by using `ggsave()`. It differs slightly from other “exporting” functions in R because it comes with some smart defaults (see `?ggsave()`) .pull-left[ ```r ggplot(mtcars, aes(mpg, wt)) + geom_point() ``` <!-- --> ] -- .pull-right[ ```r ggsave("mtcars.pdf") ``` ```r ggsave("mtcars.png") ``` ] --- ## Extensions Many package extend {ggplot2} with new themes, palettes, geoms, and even ways to make your charts interactive or animated. A selection : {ggforce}: https://ggforce.data-imaginist.com/ {patchwork}: https://patchwork.data-imaginist.com/ {ggtext}: https://github.com/clauswilke/ggtext {ggiraph}: http://davidgohel.github.io/ggiraph {plotly}: https://plotly-r.com/ {ggbeeswarm}: https://github.com/eclarke/ggbeeswarm {esquisse}: https://dreamrs.github.io/esquisse --- # Extensions You can find more extensions in: [exts.ggplot2.tidyverse.org](https://exts.ggplot2.tidyverse.org/gallery/) <img src="data:image/png;base64,#fig/ggplot_ext.png" width="800"/> --- # Extensions: {esquisse} {esquisse} is a fantastic tool to create charts from a UI, copy and paste into your own code and learn a lot about ggplot. Reproduce this plot with `esquisse::esquisser(palmerpenguins::penguins)` ```r ggplot(palmerpenguins::penguins) + aes(x = island, y = body_mass_g, fill = species) + geom_boxplot() + theme_classic() + theme(legend.position = "bottom") ``` <!-- --> --- # Extensions: {patchwork} It combines separated plots, and the data source can be different. .pull-left[ ```r library(patchwork) p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + labs( title = "Plot using mpg dataset" ) p2 <- ggplot(economics, aes(x = date, y = pce)) + geom_line() + labs( title = "Plot using economics dataset" ) p1 + p2 ``` ] .pull-right[ <!-- --> ] --- # Extensions: {patchwork} The combination of plots can have different layouts: .pull-left[ ```r p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() p2 <- ggplot(economics, aes(x = date, y = pce)) + geom_line() p3 <- ggplot(mpg, aes(x = displ)) + geom_histogram(fill = "darkblue", bins = 10) p4 <- ggplot(mpg, aes(x = class, y = displ)) + geom_boxplot() (p1 | p2 | p3) / p4 ``` ] .pull-right[ <!-- --> ]